Spatial Data

Francisco Rowe (@fcorowe)

2022-06-26

Fundamental Geographic Data Structures

Three main structures are generally used to organise geographic data:

  1. Vector data structure: The vector data structures record geographic information using points, lines and polygons in a geographic table. These tables contain information about geographic objects. Columns store information about geographic objects, attributes or features, and rows represent individual geographic objects.

  2. Raster data structures: The raster data structures record geographic data in an uniform way over a space in the form of grids. It divides geographic surfaces up into cells of constant size. Rows and columns provide information about the geographic location of a grid.

  3. Spatial graphs: Spatial graphs store connections between objects through space. These connections may derive from geographical topology (e.g. contiguity), distance, or more sophisticated dimensions, such as interaction flows (e.g. human mobility, trade and information).

Vector data structures tend to dominate the social sciences are the interest is often in capturing discrete geographic units containing populations. Here therefore we focus on vector data structures.

Vector data

To understand the structure of vector data, let’s read a dataset (Liverpool_OA.shp) describing output areas within Liverpool in the United Kingdom. To read in the data, we use the st_read() from the package sf. sf supports geometry collections, which can contain multiple geometry types in a single object. sf provides the same functionality previously provided in three separate packages sp, rgdal and rgeos (Robin et al. 2021).

For raster data, I would recommend using the package terra.

oa_shp <- st_read("./data/Liverpool_OA.shp")
## Reading layer `Liverpool_OA' from data source 
##   `/Users/franciscorowe/Dropbox/Francisco/Research/github_projects/courses/intro-gds/data/Liverpool_OA.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1584 features and 18 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 332390.2 ymin: 379748.5 xmax: 345636 ymax: 397980.1
## Projected CRS: Transverse_Mercator

We read a sf data frame containing spatial and attribute columns. We can examine the content of the data frame by using the function head(). We called the first four columns. The last column in this example contains the geographic information i.e. geometry.

class(oa_shp)
## [1] "sf"         "data.frame"
head(oa_shp[,1:4])
## Simple feature collection with 6 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 335071.6 ymin: 389876.7 xmax: 339426.9 ymax: 394479
## Projected CRS: Transverse_Mercator
##       OA_CD   LSOA_CD   MSOA_CD    LAD_CD                       geometry
## 1 E00176737 E01033761 E02006932 E08000012 MULTIPOLYGON (((335106.3 38...
## 2 E00033515 E01006614 E02001358 E08000012 MULTIPOLYGON (((335810.5 39...
## 3 E00033141 E01006546 E02001365 E08000012 MULTIPOLYGON (((336738 3931...
## 4 E00176757 E01006646 E02001369 E08000012 MULTIPOLYGON (((335914.5 39...
## 5 E00034050 E01006712 E02001375 E08000012 MULTIPOLYGON (((339325 3914...
## 6 E00034280 E01006761 E02001366 E08000012 MULTIPOLYGON (((338198.1 39...

Each row represents an output area. Each output area has multiple attributes (i.e. columns): administrative areas codes and geometry, as well as information on the local population in these areas; however, this information is not displayed above (can you access it?).

The content of the geometry column gives sf objects their spatial powers. oa_shp$geometry is a ‘list column’ that contains all the coordinates of the output areas polygons. sf objects can be plotted quickly with the base R function plot().

plot(oa_shp$geometry)

Spatial Data is Special

Traditional data

Attributes:

Challenges:

New forms of data

Rowe, F. 2021. Big Data and Human Geography. In: Demeritt, D. and Lees L. (eds) ConciseEncyclopedia of Human Geography. Edward Elgar Encyclopedias in the Social Sciences series.

Spatial Data types

Rowe, F. Arribas-Bel, D. 2021. Spatial Modelling for Data Scientists.

Different classifications of spatial data types exist. Knowing the structure of the data at hand is important to think of appropriate analytical methods.

Fig. 1. Data Types. Area / Lattice data source: Önnerfors et al. (2019). Point data source: Tao et al. (2018). Flow data source: Rowe and Patias (2020). Trajectory data source: Kwan and Lee (2004).

Lattice/Areal Data

Point Data

Flow Data

Trajectory Data

Hierarchical Structure of Data

Smaller geographical units are organised within larger geographical units.

## Simple feature collection with 6 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 335071.6 ymin: 389876.7 xmax: 339426.9 ymax: 394479
## Projected CRS: Transverse_Mercator
##       OA_CD   LSOA_CD   MSOA_CD    LAD_CD                       geometry
## 1 E00176737 E01033761 E02006932 E08000012 MULTIPOLYGON (((335106.3 38...
## 2 E00033515 E01006614 E02001358 E08000012 MULTIPOLYGON (((335810.5 39...
## 3 E00033141 E01006546 E02001365 E08000012 MULTIPOLYGON (((336738 3931...
## 4 E00176757 E01006646 E02001369 E08000012 MULTIPOLYGON (((335914.5 39...
## 5 E00034050 E01006712 E02001375 E08000012 MULTIPOLYGON (((339325 3914...
## 6 E00034280 E01006761 E02001366 E08000012 MULTIPOLYGON (((338198.1 39...

Key Challenges

Major challenges exist when working with spatial data.

Modifible Area Unit Problem (MAUP)

The MAUP represents a challenge that has troubled geographers for decades.

Two aspects of the MAUP are normally recognised in empirical analysis:

Fig. 2. MAUP effect. (a) scale effect; and, (b) zonation effect. Source: Loidl et al (2016)

Loidl, M., Wallentin, G., Wendel, R. and Zagel, B., 2016. Mapping bicycle crash risk patterns on the local scale. Safety, 2(3), p.17.

MAUP can greatly impact our results and capacity to make inferences, leading to wrong conclusions

Solutions?

No solution!

Potential mitigation strategies:

  • Analysis at different geographical scales
  • Use the smallest geography available > create random aggregations > assess changes to the results
  • Use functional areas

Ecological Fallacy

An error in the interpretation of statistical data based on aggregate information e.g.

WS Robinson, Ecological Correlations and the Behavior of Individuals, International Journal of Epidemiology, Volume 38, Issue 2, April 2009, Pages 337–341.

Spatial Dependence

Refers to the spatial association of values for an indicator, esp. spatial proximity of more similar (or less similar) than expected for randomly associated pairs of observations.

Spatial Heterogeneity

Refers to the uneven distribution of a variable’s values across space.

Spatial nonstationarity

It refers to variations in the relationship between an outcome variable and a set of predictor variables across space.